{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "e508229b",
   "metadata": {},
   "source": [
    "\n",
    "# Biostatistical Hypothesis Testing\n",
    "\n",
    "This notebook demonstrates various hypothesis testing techniques including t-tests, chi-squared tests, and regression analysis. Dataset links are provided for ease of access.\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7b291e7e",
   "metadata": {},
   "source": [
    "\n",
    "## Dataset Information and Download Links\n",
    "\n",
    "The examples in this notebook use a **Diabetes dataset**, which can be downloaded from the following source:\n",
    "\n",
    "1. **Kaggle:**\n",
    "   - [Diabetes Dataset - Kaggle](https://www.kaggle.com/datasets/mathchi/diabetes-data)\n",
    "   - This dataset includes clinical data for diabetes prediction and analysis.\n",
    "\n",
    "### Dataset Attributes\n",
    "\n",
    "- **Pregnancies**: Number of pregnancies.\n",
    "- **Glucose**: Plasma glucose concentration.\n",
    "- **BloodPressure**: Diastolic blood pressure (mm Hg).\n",
    "- **SkinThickness**: Triceps skinfold thickness (mm).\n",
    "- **Insulin**: 2-Hour serum insulin (mu U/ml).\n",
    "- **BMI**: Body mass index (weight in kg/(height in m)^2).\n",
    "- **DiabetesPedigreeFunction**: Diabetes pedigree function.\n",
    "- **Age**: Age of the patient.\n",
    "- **Outcome**: Class variable (0 = non-diabetic, 1 = diabetic).\n",
    "\n",
    "### Usage Notes\n",
    "\n",
    "- Preprocess the dataset as needed (e.g., handle missing values).\n",
    "- Refer to the [dataset documentation](https://www.kaggle.com/datasets/mathchi/diabetes-data) for more information.\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "517c3779",
   "metadata": {},
   "source": [
    "\n",
    "## T-Test: Comparing Glucose Levels Between Diabetic and Non-Diabetic Groups\n",
    "\n",
    "A t-test is performed to compare the mean glucose levels between diabetic and non-diabetic patients.\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "65b88037",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "import pandas as pd\n",
    "from scipy.stats import ttest_ind\n",
    "\n",
    "# Load the dataset (replace with your file path)\n",
    "data = pd.read_csv(r'C:\\Path\\to\\diabetes.csv')\n",
    "\n",
    "# Separate diabetic and non-diabetic groups\n",
    "diabetic = data[data['Outcome'] == 1]\n",
    "non_diabetic = data[data['Outcome'] == 0]\n",
    "\n",
    "# Perform t-test for glucose levels\n",
    "t_statistic, p_value = ttest_ind(diabetic['Glucose'], non_diabetic['Glucose'])\n",
    "\n",
    "print(f\"t-statistic: {t_statistic}\")\n",
    "print(f\"p-value: {p_value}\")\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dd274db4",
   "metadata": {},
   "source": [
    "\n",
    "## Chi-Squared Test: Diabetes and Elevated BMI\n",
    "\n",
    "A chi-squared test is performed to evaluate the association between diabetes status and elevated BMI.\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ff896bcf",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "from scipy.stats import chi2_contingency\n",
    "\n",
    "# Define elevated BMI\n",
    "data['Elevated_BMI'] = data['BMI'].apply(lambda x: 'Yes' if x >= 25 else 'No')\n",
    "\n",
    "# Create a contingency table\n",
    "contingency_table = pd.crosstab(data['Outcome'], data['Elevated_BMI'])\n",
    "\n",
    "# Perform chi-squared test\n",
    "chi2, p, dof, expected = chi2_contingency(contingency_table)\n",
    "\n",
    "print(f\"Chi-squared value: {chi2}\")\n",
    "print(f\"P-value: {p}\")\n",
    "print(f\"Degrees of freedom: {dof}\")\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9666bb4d",
   "metadata": {},
   "source": [
    "\n",
    "## Linear Regression: Glucose vs Age\n",
    "\n",
    "A simple linear regression is performed to analyze the relationship between glucose levels and age.\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "93064ac5",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "import statsmodels.formula.api as smf\n",
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# Perform linear regression\n",
    "linreg = smf.ols(formula='Glucose ~ Age', data=data).fit()\n",
    "\n",
    "# Print regression summary\n",
    "print(linreg.summary())\n",
    "\n",
    "# Plot the regression line\n",
    "sns.regplot(x='Age', y='Glucose', data=data, ci=95)\n",
    "plt.title(\"Linear Regression: Glucose vs Age\")\n",
    "plt.show()\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "52b242c7",
   "metadata": {},
   "source": [
    "\n",
    "## Logistic Regression: Predicting Diabetes Using Glucose and BMI\n",
    "\n",
    "A logistic regression model is created to predict diabetes status based on glucose levels and BMI.\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1c7e7e8d",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "import statsmodels.api as sm\n",
    "\n",
    "# Prepare data for logistic regression\n",
    "data['Intercept'] = 1\n",
    "logit_model = sm.Logit(data['Outcome'], data[['Glucose', 'BMI', 'Intercept']]).fit()\n",
    "\n",
    "# Print logistic regression summary\n",
    "print(logit_model.summary())\n",
    "        "
   ]
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 5
}
